May 27, 2015
Concern from the statistics community
Former Googler Rachel Schutt taught a similar Data Science class course at Columbia University.
Only intro stats (know what a standard error and regression are) and some exposure to R
18 students, mostly juniors and seniors.
| Major | Count |
|---|---|
| Mathematics | 4 |
| Biological Science: Biology & Biochem and Molecular Biology | 4 |
| Other Science: Chemistry, Environmental Studies, Physics | 4 |
| Social Science: Political Science, Sociology | 2 |
| Economics | 2 |
| Misc: Psychology, Linguistics | 2 |
ASA's GAISE Reports
dplyr package for data wranglingggplot2 package for data visualizationWe set the restriction that all our data exists in a matrix called a data frame, which we say has the "tidy" property:
Most data manipulations can be achieved by the following verbs on a "tidy" data:
filter: keep rows matching criteriasummarise: reduce variables to valuesmutate: add new variablesarrange: reorder rowsselect: pick columns by namejoin: join two data framesgroup_by: group subsets of observations togetherThe pipe %>% command, pronounced "then".
For example: say you want to apply functions h() and g() and then f() on data x. You can do
f(g(h(x))) ORh(x) %>% g() %>% f()A statistical graphic consists of a mapping of variables in data to aesthetic attributes of geometric objects that we can observe.
ggplot2 allows us to construct graphics in a modular fashion by specifying these components.
| Data (Variable) | Aesthetic | Geometric Object |
|---|---|---|
| longitude | x position | points |
| latitude | y position | points |
| army size | size = width | bars |
| army direction | color = brown or black | bars |
| date | (x,y) position | text |
| temperature | (x,y) position | lines |
Dataset consisting of all 227,496 flights domestic flights leaving Houston airport (IAH) in 2011. Four data frames:
flights: flight infoweather: hourly weather infoplanes: information on all 2853 airplanesairports: destination airport informationRennie Meyers
Will Jones
Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=5995\)). 40.2% of the population was female.
Goal was to use logistic regression to predict gender.
Miguel Connor
Many students
Data visualization is a backdoor way to get students interested in statistics.
Trick them into thinking they're not learning and they do.
This is the only stats class many will take.
Presentation on 2011/06/27 given by Dierdre and Amir:
In the repositories section of github.com/rudeboybert